Why is this relevant?
In learning R today, we’ll take an approach, used in R for Data Science (Wickham & Grolemund, 2016), diagrammed like this:
One of the strengths of R: every step of this workflow can be done using it!
https://www.rstudio.com/products/rstudio/download3/#download
install.packages('tidyverse')
library(tidyverse)A package only needs to be installed once (per major version, e.g. 3.3.x to 3.4.x), but must be loaded every time.
Data on the fuel efficiency of 38 models of cars between 1999 and 2008 from the US EPA:
mpg## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31
## 4 audi a4 2.0 2008 4 auto(av) f 21 30
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26
## 7 audi a4 3.1 2008 6 auto(av) f 18 27
## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25
## 10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>
aes)ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))hwy vs cyl.class vs drv?ggplot(data = mpg) +
geom_point(mapping = aes(x = hwy, y = cyl))ggplot(data = mpg) +
geom_point(mapping = aes(x = class, y = drv))What’s going on here? Is this useful? How might we make it more useful?
What about those outliers? Use the color aesthetic.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))What’s happening here?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")What’s happened here? What colour are the points? Why?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))mpg variable types: which are categorical, which are continuous??mpgaes(color = displ < 5)?ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))## `geom_smooth()` using method = 'loess'
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))## `geom_smooth()` using method = 'loess'
?geom_smooth
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
geom_smooth(mapping = aes(x = displ, y = hwy, color = drv, linetype = drv))## `geom_smooth()` using method = 'loess'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
A common scenario:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")1 / 200 * 30## [1] 0.15
(59 + 73 + 2) / 3## [1] 44.66667
sin(pi / 2)## [1] 1
x <- 3 * 4Calling R functions, in general:
function_name(arg1 = val1, arg2 = val2, ...)
An example function, seq:
seq(1, 10)## [1] 1 2 3 4 5 6 7 8 9 10
dplyr and tidyr (tidyverse)We’ll look at CANSIM Tables 0477-0058 and 0477-0059 which cover university revenues and expenditures, respectively.
To save time, I’ve already downloaded the data from CANSIM. You can get it here:
We’ll use read_csv() from readr, which I prefer to the base R function, as it has some more sensible defaults, and it’s faster.
# http://www20.statcan.gc.ca/tables-tableaux/cansim/csv/04770058-eng.zip
# make sure the zip file is in the same directory as your RStudio project
revenue_raw <- read_csv('04770058-eng.zip') ## Parsed with column specification:
## cols(
## Ref_Date = col_character(),
## GEO = col_character(),
## SCHOOL = col_character(),
## REVENUE = col_character(),
## FUND = col_character(),
## Vector = col_character(),
## Coordinate = col_character(),
## Value = col_double()
## )
The read_csv() function tells us that it guessed the types of the various columns. In this situation, the default guesses are fine, but of course we can force it to treat columns certain ways if we wish.
revenue_raw## # A tibble: 122,240 x 8
## Ref_Date GEO SCHOOL REVENUE
## <chr> <chr> <chr> <chr>
## 1 2000/2001 Canada Total universities and colleges Total revenues
## 2 2001/2002 Canada Total universities and colleges Total revenues
## 3 2002/2003 Canada Total universities and colleges Total revenues
## 4 2003/2004 Canada Total universities and colleges Total revenues
## 5 2004/2005 Canada Total universities and colleges Total revenues
## 6 2005/2006 Canada Total universities and colleges Total revenues
## 7 2006/2007 Canada Total universities and colleges Total revenues
## 8 2007/2008 Canada Total universities and colleges Total revenues
## 9 2008/2009 Canada Total universities and colleges Total revenues
## 10 2009/2010 Canada Total universities and colleges Total revenues
## # ... with 122,230 more rows, and 4 more variables: FUND <chr>,
## # Vector <chr>, Coordinate <chr>, Value <dbl>
We have 8 columns and 122,240 rows of data. read_csv() brings the data in as a tibble, which is just an R “data frame”, but with some handy defaults, some of which we’re seeing here. For instance, it gives us the size of the data frame in rows and columns, the types of the columns (e.g., “<chr>” for character) and only prints the first 10 rows, instead of overwhelming us with all of the data.
View(revenue_raw) # in RStudioOr click the icon in the Environment tab.
head(revenue_raw, 1) # show just the first row(See Wickham, 2014 and R for Data Science, chapter 12)
Is the CANSIM data tidy?
filter()).arrange()).select()).mutate()).Collapse many values down to a single summary (summarise()).
All can be used in conjunction with group_by()
https://www.rstudio.com/resources/cheatsheets/
(or Help | Cheatsheets in RStudio!)
filter() the rows we wantfilter(revenue_raw,
SCHOOL == 'Total universities and colleges',
FUND == 'Total funds (x 1,000)')## # A tibble: 5,632 x 8
## Ref_Date GEO SCHOOL REVENUE
## <chr> <chr> <chr> <chr>
## 1 2000/2001 Canada Total universities and colleges Total revenues
## 2 2001/2002 Canada Total universities and colleges Total revenues
## 3 2002/2003 Canada Total universities and colleges Total revenues
## 4 2003/2004 Canada Total universities and colleges Total revenues
## 5 2004/2005 Canada Total universities and colleges Total revenues
## 6 2005/2006 Canada Total universities and colleges Total revenues
## 7 2006/2007 Canada Total universities and colleges Total revenues
## 8 2007/2008 Canada Total universities and colleges Total revenues
## 9 2008/2009 Canada Total universities and colleges Total revenues
## 10 2009/2010 Canada Total universities and colleges Total revenues
## # ... with 5,622 more rows, and 4 more variables: FUND <chr>,
## # Vector <chr>, Coordinate <chr>, Value <dbl>
filter() help tells us that everything after the first argument (i.e., the “…”) are: “Logical predicates defined in terms of the variables in .data. Multiple conditions are combined with &.”
> # greater than
>= # greater than or equal to
< # less than
<= # less than or equal to
!= # not equal
== # equal
%in% # not quite a comparison operator, but handy
# x %in% y is true if the values of vector x are present in vector yOther operators we can use with filter():
Complete set of boolean operations. x is the left-hand circle, y is the right-hand circle, and the shaded regions show which parts each operator selects.
filter()In other words, the previous command is equivalent to this:
filter(revenue_raw,
SCHOOL == 'Total universities and colleges' &
FUND == 'Total funds (x 1,000)')## # A tibble: 5,632 x 8
## Ref_Date GEO SCHOOL REVENUE
## <chr> <chr> <chr> <chr>
## 1 2000/2001 Canada Total universities and colleges Total revenues
## 2 2001/2002 Canada Total universities and colleges Total revenues
## 3 2002/2003 Canada Total universities and colleges Total revenues
## 4 2003/2004 Canada Total universities and colleges Total revenues
## 5 2004/2005 Canada Total universities and colleges Total revenues
## 6 2005/2006 Canada Total universities and colleges Total revenues
## 7 2006/2007 Canada Total universities and colleges Total revenues
## 8 2007/2008 Canada Total universities and colleges Total revenues
## 9 2008/2009 Canada Total universities and colleges Total revenues
## 10 2009/2010 Canada Total universities and colleges Total revenues
## # ... with 5,622 more rows, and 4 more variables: FUND <chr>,
## # Vector <chr>, Coordinate <chr>, Value <dbl>
I prefer this notation, as it’s more explicit.
filter()But, one more thing: we need to assign the return value of the filter() function back to a variable:
revenue <- filter(revenue_raw,
SCHOOL == 'Total universities and colleges' &
FUND == 'Total funds (x 1,000)')Shortcut for assignment operator: Alt-Hyphen in RStudio
A new variable for each step gets cumbersome, so dplyr provides an operator, the pipe (%>%) that combines operations:
revenue_long <- revenue_raw %>%
# only rows matching this
filter(SCHOOL == 'Total universities and colleges' &
FUND == 'Total funds (x 1,000)') %>%
# remove these columns
select(-SCHOOL, -FUND, -Vector, -Coordinate) %>%
# fix up the date column
mutate(Ref_Date = as.integer(stringr::str_sub(Ref_Date, 1, 4)))x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) etc.
Shortcut for pipe operator: Ctrl-Shift-M in RStudio
head(revenue_long, 1) # looks good so far!## # A tibble: 1 x 4
## Ref_Date GEO REVENUE Value
## <int> <chr> <chr> <dbl>
## 1 2000 Canada Total revenues 16224715
starts_with("abc"): matches names that begin with “abc”.ends_with("xyz"): matches names that end with “xyz”.contains("ijk"): matches names that contain “ijk”.matches("(.)\\1"): selects variables that match a regular expression.num_range("x", 1:3): matches x1, x2 and x3.one_of(vector): columns whose names are in said vector.tidyrThe two main tidyr functions:
gather(data, key, value): Moves column names into a key column, with the values going into a single value column.spread(data, key, value): Moves unique values of key column into column names, with values from the value column.tidyr Cheatsheethttps://www.rstudio.com/resources/cheatsheets/ (Data Import Cheatsheet)
spread() CANSIMrevenue <- revenue_long %>%
spread(REVENUE, Value)
revenue## # A tibble: 176 x 34
## Ref_Date GEO `Canada Foundation for Innovation`
## * <int> <chr> <dbl>
## 1 2000 Alberta 8208
## 2 2000 British Columbia 5189
## 3 2000 Canada 212418
## 4 2000 Manitoba 7170
## 5 2000 New Brunswick 1745
## 6 2000 Newfoundland and Labrador 2799
## 7 2000 Nova Scotia 3234
## 8 2000 Ontario 91066
## 9 2000 Prince Edward Island 355
## 10 2000 Quebec 75988
## # ... with 166 more rows, and 31 more variables: `Canada Research
## # Chairs` <dbl>, `Canadian Institute of Health Research` <dbl>, `Credit
## # courses tuiton` <dbl>, `Donations made by business enterprises` <dbl>,
## # `Donations made by individuals` <dbl>, `Donations made by
## # not-for-profit organizations` <dbl>, Endowment <dbl>, Federal <dbl>,
## # Foreign <dbl>, `Grants made by business enterprises` <dbl>, `Grants
## # made by Individuals` <dbl>, `Grants made by not-for-profit
## # organizations` <dbl>, `Health Canada` <dbl>, Investments <dbl>,
## # Miscellaneous <dbl>, Municipal <dbl>, `Natural Sciences and
## # Engineering Research Council` <dbl>, `Non-credit tution` <dbl>,
## # `Non-federal` <dbl>, `Other federal` <dbl>, `Other fees` <dbl>, `Other
## # investments` <dbl>, `Other provinces` <dbl>, `Other revenue
## # type` <dbl>, Provincial <dbl>, `Sale of service and products` <dbl>,
## # `Social Sciences and Humanities Research Council` <dbl>, `Total
## # donations` <dbl>, `Total grants` <dbl>, `Total revenues` <dbl>,
## # `Tuition and other fees` <dbl>
revenue %>%
filter(GEO == 'Canada') %>%
# pass the result to ggplot() as the first argument
ggplot(aes(Ref_Date, `Total revenues`)) +
# now it switches to + to combine, which is ggplot's way
geom_line()Not the prettiest, but it works!
revenue %>%
filter(GEO == 'Canada') %>%
ggplot(aes(Ref_Date, `Total revenues`)) +
geom_line() +
labs(title = 'Total University and Degree-Granting College Revenue',
x = 'Fiscal Year',
y = 'Revenue ($ thousands)') +
scale_y_continuous(labels = scales::comma)revenue %>%
filter(GEO %in% c('Canada', 'Alberta', 'Ontario')) %>%
ggplot(aes(Ref_Date, `Tuition and other fees` / `Total revenues`, color = GEO)) +
geom_line() +
labs(title = 'Tuition and fees as a share of total revenue') +
scale_y_continuous(labels = scales::percent)dplyr “verbs”Which province took in the most tuition revenue in 2010?
revenue %>%
filter(GEO != 'Canada', Ref_Date == 2010) %>%
select(GEO, `Tuition and other fees`) %>%
arrange(desc(`Tuition and other fees`))## # A tibble: 10 x 2
## GEO `Tuition and other fees`
## <chr> <dbl>
## 1 Ontario 3563011
## 2 British Columbia 1009476
## 3 Quebec 825943
## 4 Alberta 711973
## 5 Nova Scotia 305597
## 6 Manitoba 176165
## 7 Saskatchewan 173600
## 8 New Brunswick 146213
## 9 Newfoundland and Labrador 61537
## 10 Prince Edward Island 27878
revenue %>%
select(Ref_Date, GEO, `Tuition and other fees`, `Total revenues`) %>%
mutate(tuition_share = `Tuition and other fees` / `Total revenues`)## # A tibble: 176 x 5
## Ref_Date GEO `Tuition and other fees`
## <int> <chr> <dbl>
## 1 2000 Alberta 298464
## 2 2000 British Columbia 321017
## 3 2000 Canada 3052960
## 4 2000 Manitoba 99932
## 5 2000 New Brunswick 79153
## 6 2000 Newfoundland and Labrador 49954
## 7 2000 Nova Scotia 173980
## 8 2000 Ontario 1509442
## 9 2000 Prince Edward Island 13778
## 10 2000 Quebec 407458
## # ... with 166 more rows, and 2 more variables: `Total revenues` <dbl>,
## # tuition_share <dbl>
Keeps the same number of rows
+, -, *, /, ^%/% (integer division), %% (remainder)log(), log2(), log10()lead(), lag()cumsum(), cumprod(), cummin(), cummax(), cummean()<, <=, >, >=, !=, ==min_rank(), row_number(), dense_rank(), percent_rank(), cume_dist(), ntile()ifelse()recode()case_when()revenue %>%
group_by(GEO) %>%
summarise(avg_endowment_revenue = mean(Endowment)) %>%
arrange(-avg_endowment_revenue)## # A tibble: 11 x 2
## GEO avg_endowment_revenue
## <chr> <dbl>
## 1 Canada 526070.7500
## 2 Ontario 211474.5625
## 3 Quebec 87322.9375
## 4 Alberta 80069.0000
## 5 British Columbia 75751.6250
## 6 Nova Scotia 31147.7500
## 7 Saskatchewan 21255.5625
## 8 New Brunswick 13778.1875
## 9 Newfoundland and Labrador 2877.6875
## 10 Manitoba 1646.5000
## 11 Prince Edward Island 746.9375
mean(x), median(x)sd(x), IQR(x) (interquartile range), mad(x) (median absolute deviation)min(x), max(x), quantile(x, 0.25)first(x), nth(x, 2), last(x)n(x), n_distinct(x)sum(x > 10), mean(y == 0)
TRUE is converted to 1 and FALSE to 0Compute the total tuition revenue of the three western-most provinces in 2007.
revenue %>%
filter(GEO %in% c('British Columbia', 'Alberta', 'Saskatchewan'), Ref_Date == 2007) %>%
summarise(sum(`Tuition and other fees`))## # A tibble: 1 x 1
## `sum(\`Tuition and other fees\`)`
## <dbl>
## 1 1374509
These three provinces compared to all other provinces and national average.
revenue %>%
filter(Ref_Date == 2007) %>%
mutate(category = case_when(
GEO %in% c('British Columbia', 'Alberta', 'Saskatchewan') ~ '3 Western Provinces',
GEO == 'Canada' ~ GEO,
TRUE ~ 'All Other Provinces'
)) %>%
group_by(category) %>%
summarise(sum(`Tuition and other fees`))## # A tibble: 3 x 2
## category `sum(\`Tuition and other fees\`)`
## <chr> <dbl>
## 1 3 Western Provinces 1374509
## 2 All Other Provinces 4079866
## 3 Canada 5454375
tuition_relative_change <- revenue %>%
arrange(Ref_Date, GEO) %>%
group_by(GEO) %>%
mutate(rel_change = (`Tuition and other fees` - first(`Tuition and other fees`)) / first(`Tuition and other fees`)) %>%
select(Ref_Date, GEO, `Tuition and other fees`, rel_change)
ggplot(tuition_relative_change, aes(Ref_Date, rel_change, color = GEO)) +
geom_line()Count years with positive endowment income by province
revenue %>%
group_by(GEO) %>%
summarise(num_pos_endow = sum(Endowment > 0))## # A tibble: 11 x 2
## GEO num_pos_endow
## <chr> <int>
## 1 Alberta 13
## 2 British Columbia 13
## 3 Canada 13
## 4 Manitoba 14
## 5 New Brunswick 13
## 6 Newfoundland and Labrador 13
## 7 Nova Scotia 14
## 8 Ontario 12
## 9 Prince Edward Island 13
## 10 Quebec 16
## 11 Saskatchewan 12
purrr (quickly)revenue %>%
filter(GEO != 'Canada') %>%
group_by(Ref_Date) %>%
summarise(revenue_quantiles = list(quantile(`Total revenues`, c(0.25, 0.5, 0.75)))) %>%
mutate(
low_25 = map_dbl(revenue_quantiles, "25%"),
mid = map_dbl(revenue_quantiles, '50%'),
high_75 = map_dbl(revenue_quantiles, '75%')
)## # A tibble: 16 x 5
## Ref_Date revenue_quantiles low_25 mid high_75
## <int> <list> <dbl> <dbl> <dbl>
## 1 2000 <dbl [3]> 424965.0 681741.5 1944551
## 2 2001 <dbl [3]> 417664.5 700424.0 2126550
## 3 2002 <dbl [3]> 465112.5 729434.0 2378510
## 4 2003 <dbl [3]> 480403.8 804636.5 2708008
## 5 2004 <dbl [3]> 530995.2 853131.5 2818174
## 6 2005 <dbl [3]> 560023.2 911819.0 3257599
## 7 2006 <dbl [3]> 616411.5 964536.5 3320636
## 8 2007 <dbl [3]> 592926.8 944042.5 3560364
## 9 2008 <dbl [3]> 568460.8 968932.5 3404852
## 10 2009 <dbl [3]> 741015.5 1161066.0 4363439
## 11 2010 <dbl [3]> 787081.5 1213456.0 4436956
## 12 2011 <dbl [3]> 775058.8 1169906.5 4177081
## 13 2012 <dbl [3]> 799796.8 1240542.0 4296184
## 14 2013 <dbl [3]> 791843.2 1298845.0 4437751
## 15 2014 <dbl [3]> 799297.2 1316785.0 4623504
## 16 2015 <dbl [3]> 770570.8 1261935.5 4451983
NA > 5## [1] NA
10 == NA## [1] NA
NA + 10## [1] NA
What about this?
NA / 2
x <- NA
is.na(x)## [1] TRUE
filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:
test_data <- tibble(x = c(1, NA, 3))
filter(test_data, x > 1)## # A tibble: 1 x 1
## x
## <dbl>
## 1 3
filter(test_data, is.na(x) | x > 1)## # A tibble: 2 x 1
## x
## <dbl>
## 1 NA
## 2 3
c(1, 2, 3)## [1] 1 2 3
1:3## [1] 1 2 3
1:3 + 1:9## [1] 2 4 6 5 7 9 8 10 12
What about…
1:3 + 1:10
1:3 + 1:10## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter
## object length
## [1] 2 4 6 5 7 9 8 10 12 11
What happened?
mean(c(1,2,3))## [1] 2
What do we expect here?
mean(c(1,NA,3))
mean(c(1,NA,3))## [1] NA
mean(c(1,NA,3), na.rm = TRUE)## [1] 2
c(x = 1, y = 2, z = 4)## x y z
## 1 2 4
set_names(1:3, c("a", "b", "c"))## a b c
## 1 2 3
# positive values
x <- c("one", "two", "three", "four", "five")
x[c(3, 2, 5)]## [1] "three" "two" "five"
# negative values
x[c(-1, -3, -5)]## [1] "two" "four"
#> [1] "two" "four"
# subset a named vector w/character vector
x <- c(abc = 1, def = 2, xyz = 5)
x[c("xyz", "def")]## xyz def
## 5 2
x <- c(10, 3, NA, 5, 8, 1, NA)
# All non-missing values of x
x[!is.na(x)]## [1] 10 3 5 8 1
#> [1] 10 3 5 8 1
# All even (or missing!) values of x
x[x %% 2 == 0]## [1] 10 NA 8 NA
#> [1] 10 NA 8 NAx <- list(1, 2, 3)
x## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
str(x) # str...ucture## List of 3
## $ : num 1
## $ : num 2
## $ : num 3
Can contain different types of objects:
y <- list("a", 1L, 1.5, TRUE)
str(y)## List of 4
## $ : chr "a"
## $ : int 1
## $ : num 1.5
## $ : logi TRUE
Or other lists:
z <- list(list(1, 2), list(3, 4))
str(z)## List of 2
## $ :List of 2
## ..$ : num 1
## ..$ : num 2
## $ :List of 2
## ..$ : num 3
## ..$ : num 4
Start here:
Additional:
For help/community: